AITopics | dna data storage

Clustering Billions of Reads for DNA Data Storage

Neural Information Processing SystemsNov-21-2025, 15:57:03 GMT

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters.

algorithm, dna data storage, name change, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Information Management (0.69)
Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Clustering Billions of Reads for DNA Data Storage

Cyrus Rashtchian, Konstantin Makarychev, Miklos Racz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, Karin Strauss

Neural Information Processing SystemsNov-21-2025, 12:18:43 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Workflow (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Trace Reconstruction with Language Models

Weindel, Franziska, Girsch, Michael, Heckel, Reinhard

arXiv.org Artificial IntelligenceJul-18-2025

The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of the data retrieval process. In this work, we propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction. We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a substantially higher fraction of sequences without error.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.12927

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reviews: Clustering Billions of Reads for DNA Data Storage

Neural Information Processing SystemsOct-8-2024, 06:23:05 GMT

The paper presents a solution to a new type of clustering problem that has emerged from studies of DNA-based storage. Information is encoded within DNA sequences and retrieved using short-read sequencing technology. The short-read sequencer will create multiple short overlapping sequence reads and these have to be clustered to establish whether they are from the same place in the original sequence. The characteristics of the clustering problem is that the clusters are pretty tight in terms of edit distance (25 max diameter here - that seems quite broad given current sequencing error rates) but well separated from each other (much larger distance between them than diameter). I thought this was an interesting and timely application.

dna data storage, error rate, substitution error rate, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.83)

Add feedback

Clustering Billions of Reads for DNA Data Storage

Cyrus Rashtchian, Konstantin Makarychev, Miklos Racz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, Karin Strauss

Neural Information Processing SystemsOct-4-2024, 02:16:31 GMT

Neural Information Processing Systems http://nips.cc/

accuracy, algorithm, dataset, (13 more...)

Neural Information Processing Systems

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Workflow (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Clustering Billions of Reads for DNA Data Storage

Rashtchian, Cyrus, Makarychev, Konstantin, Racz, Miklos, Ang, Siena, Jevdjic, Djordje, Yekhanin, Sergey, Ceze, Luis, Strauss, Karin

Neural Information Processing SystemsFeb-15-2020, 19:28:13 GMT

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy.

algorithm, dna data storage, edit distance, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Information Management (0.69)
Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

Clustering Billions of Reads for DNA Data Storage

Rashtchian, Cyrus, Makarychev, Konstantin, Racz, Miklos, Ang, Siena, Jevdjic, Djordje, Yekhanin, Sergey, Ceze, Luis, Strauss, Karin

Neural Information Processing SystemsDec-31-2017

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA data storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Workflow (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

To make better computers, researchers look to microbiology

Christian Science Monitor | ScienceMar-3-2017, 04:55:02 GMT

March 2, 2017 --Computer engineers have created some amazingly small devices, capable of storing entire libraries of music and movies in the palm of your hand. But geneticists say Mother Nature can do even better. DNA, where all of biology's information is stored, is incredibly dense. The whole genome of an organism fits into a cell that is invisible to the naked eye. That's why computer scientists are turning to microbiology to design the next best way to store humanity's ever-increasing collection of digital data.

artificial intelligence, dna, erlich, (16 more...)

Christian Science Monitor | Science

Country:

North America > United States > New York (0.05)
North America > United States > Massachusetts (0.05)
North America > United States > California > San Francisco County > San Francisco (0.05)
Europe > United Kingdom (0.05)

Genre: Research Report (0.30)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence (0.83)

Add feedback

To make better computers, researchers turn to microbiology

Christian Science Monitor | ScienceMar-2-2017, 23:40:14 GMT

March 2, 2017 --Computer engineers have created some amazingly small devices, capable of storing entire libraries of music and movies in the palm of your hand. But geneticists say Mother Nature can do even better. DNA, where all of biology's information is stored, is incredibly dense. The whole genome of an organism fits into a cell that is invisible to the naked eye. That's why computer scientists are turning to microbiology to design the next best way to store humanity's ever-increasing collection of digital data.

artificial intelligence, dna, erlich, (16 more...)

Christian Science Monitor | Science

Country: